All Questions
Tagged with bigdataapache-spark
28 questions
0votes
0answers
14views
Stuck on loading parquet files recursively of varying size with Spark
I am using Spark on Scala via and Almond kernel for Jupyter to load several parquet files with varying size. I have a single worker with 10 cores and memory allowance of 10GB. When I execute the ...
0votes
0answers
37views
How to determine the best number of cores and memory for Spark job
How can we determine the optimal number of cores and memory for running Spark jobs based on data volume, the number of jobs, and their frequency? From what I've read, we can determine the number of ...
3votes
0answers
168views
Clustering large set of images
I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of ...
4votes
1answer
1kviews
What is the main difference between Hadoop and Spark? [closed]
I recently read the following about Hadoop vs. Spark: Insist upon in-memory columnar data querying. This was the killer-feature that let Apache Spark run in seconds the queries that would take Hadoop ...
2votes
1answer
316views
Spark: How to run PCA parallelized? Only one thread used
I use pySpark and set my configuration like following: ...
1vote
0answers
34views
What is important for Pharmaceutical companies to answer with Big Data Analysis?
I am a data scientist, and I have some biological background (genetics). I have been asked to give a talk for our customers from pharmaceutical industry. I should show them how they benefit from Big ...
0votes
1answer
125views
Creating more than one worker nodes for local windows machine [closed]
I am using windows laptop. And I installed apache spark for my laptop. And I try to measure spark performance by changing spark components. because of that I want to create more than one worker nodes ...
0votes
1answer
527views
How to run Spark python code in Jupyter Notebook via command prompt
I am trying to import a data frame into spark using Python's pyspark module. For this, I used Jupyter Notebook and executed the code shown in the screenshot below After that I want to run this in CMD ...
2votes
0answers
293views
How to create tensors in spark?
I have the following data stored in HDFS: each row has three columns, id, date, item, which means a person with a particular id bought a particular item on a particular date. The dataset has billions ...
1vote
0answers
41views
Is there a way to use a pom.xml file to update spark configuration?
I am trying to update my spark configuration to solve some dependency problems. This pom.xml file seems to be useful for this purpose. I am using a spark docker image. ...
1vote
0answers
513views
Spark Scala concatenate 2 different data frames
I have 2 different Spark data frames and I want to concatenate them together by columns with no join operations. How can I do it using Scala?
3votes
1answer
14kviews
3votes
2answers
836views
Navigating the jungle of choices for scalable ML deployment
I have prototyped a machine learning (ML) model on my local machine and would like to scale it to both train and serve on much larger datasets than could be feasible on a single machine. (The model ...
2votes
0answers
2kviews
Alternative to Apache Spark? [closed]
I have been looking for a comprehensive alternative to Apache Spark for Big Data analytics/machine learning and couldn't find one. The ones which I have come across are: Apache Flink Google DataFlow ...
2votes
1answer
73views
does storing file in hdfs parallelize it for Spark?
For Spark's RDD operations, data must be in shape of RDD or be parallelized using: ParallelizedData = sc.parallelize(data) My question is that if I store data in ...